server: implement GLM-style MTP #15225

F1LM1 · 2025-08-11T05:52:35Z

This is very much a draft/proof of concept I'm playing with, just one idea for an MTP implementation. Planning to test on GLM-4.5 because it's the only model out there that we've preserved NextN tensors for.

From what I can tell

the three models with MTP implemented in vLLM right now are all "DeepseekV3-style,"
they only have one MTP head, which predicts token at position n+2,
the MTP layers take as input the output embedding from the last conventional layer and their own input embedding.

So implementation-wise it seems like

we should try to reuse the existing speculative decode functionality (including nice stuff like main model KV cache management, various samplers, etc.),
but a lot of the full draft model functionality is redundant/harmful, like context/cache management for the draft model, vocab matching,
it probably makes sense to write a new function like mtp_speculative_gen_draft in speculative.cpp that is vastly simplified and branch into it in server.cpp when a slot has MTP (versus common_speculative_gen_draft).
AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.
It doesn't make sense to have to manage a distinct ctx_dft in this case as well. It's a bit hacky but I was thinking we could just have ctx_dft = ctx and then have both normal and MTP passes write over the shared ctx logits. I think this minimizes required code changes elsewhere

This is my first time (1) working with ML stuff outside of python (2) attempting to contribute, so patience is appreciated :)

…lative.cpp

ggerganov · 2025-08-13T06:58:33Z

AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.

This is correct - we always alternate between conventional and speculative passes. It's definitely not optimal, but improves flexibility for regular sampling. It allows to change the speculative parameters and even disable it per request, while the logic is quite simple.

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

ggerganov · 2025-08-13T07:52:04Z

Generally we should try to minimize the changes to llama.h, since changing/extending the public API requires a lot of effort.

On first look, I think the path that involves minimal changes is:

Add int n_mtp flag to llama_context_params (default = 1 - MTP is disabled, 2 - predict logits for one additional token, 3 - predict logits for 2 additional tokens, etc.)
Use this flag during graph build to determine if the MTP heads should be appended to the graph
Keep the conventional logits in the t_logits tensor in llm_graph_result
Add new tensor t_logits_mtp (or whatever is more appropriate) in llm_graph_result and use it to store the MTP results in it
In llama_decode() extract the t_logits_mtp data when available, following the same logic as for t_logits

Extracting the MTP logits during llama_decode() can be done in 2 ways:

Create separate buffer in the llama_context to store them and add a new llama_get_logits_mtp_ith() API that works with that new buffer in a similar way as the existing llama_get_logits_ith()
Reuse the existing logits buffer by expanding it to from [n_outputs][n_vocab] to [n_outputs][n_mtp*n_vocab]. This would avoid the need to add llama_get_logits_mtp_ith() and we can generalize the existing llama_get_logits_ith() by taking into account the value of n_mtp.

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

In any case, you can avoid this until you get the implementation working with a reasonable speedup. After that, we can discuss further how to best refactor the implementation.

slaren · 2025-08-13T11:36:20Z

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

I don't see an issue with adding a new API for this, and it would be easier to use.

Juahyori · 2025-08-13T16:46:10Z

Out of curiosity, is the API for this expected to be flexible enough that we could jump off of it to add things like Medusa / Eagle style (or IBM Accelerator) self speculative decoding heads?

I'm pretty sure they work fairly similarly (depending on the final output embeddings of the current token).

Another note:

After some consideration I think the expected speedup of the MTP module will depend a lot on the hardware the model's running on, particularly because it's an MoE model. While the next token prediction depends only on the current state, if we're doing self speculative decoding, that's additional forward passes. Those forward passes aren't guaranteed to have the same expert usage patterns, meaning the speedup should be some function of the tokens predicted and the expert re-use coefficient for the tokens verified.

So, just noting that if it's implemented and there's not a 2x or 3x increase in T/s, it may not be a skill issue on the part of a contributor, but due to the mathematical nature of the calculation.

For people running franken setups with Attention / KV Cache on GPU and MoE FFNs on CPU, it's possible that using previously unused experts in the verification sweep may result in a weird situation where the parallel verification process is actually memory bandwidth bound.

Not to discourage the implementation of this, I just wanted to give a heads up so nobody's dejected if the theoretical speedups can't be hit. There should still be at least some speedup, though.

F1LM1 · 2025-08-14T07:57:37Z

Thanks all for the suggestions. Will definitely look to refactor into something nicer once correctness can be established.
Right now, still trying to get the graph to compute. Turns out reusing the main model's ctx comes with a handful of issues, like the scheduler being weird and not releasing properly.

So, just noting that if it's implemented and there's not a 2x or 3x increase in T/s, it may not be a skill issue on the part of a contributor, but due to the mathematical nature of the calculation.

For people running franken setups with Attention / KV Cache on GPU and MoE FFNs on CPU, it's possible that using previously unused experts in the verification sweep may result in a weird situation where the parallel verification process is actually memory bandwidth bound.

Not to discourage the implementation of this, I just wanted to give a heads up so nobody's dejected if the theoretical speedups can't be hit. There should still be at least some speedup, though.

Yeah, I'd generally recommend that people temper their expectations with this. Especially given these three models only have one MTP head the theoretical performance gain is hard bounded by 2x on the top end, and that's assuming a perfectly efficient implementation and 100% draft acceptance.

In the absence of actual data from a working prototype... I'd probably guess that the implementation after this PR will be on the order of 40% speedup, then up to 80% after completing this:

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

Optimistically, I hope to have an ugly but working prototype done sometime today.

F1LM1 · 2025-08-16T03:09:05Z

I've gotten to the point where I can get the MTP head to output stuff but managing KV cache with an external call to a separate MTP graph adds an unbelievable amount of complexity: I think we need to do a forward pass for the MTP layer not just when we're sampling, but for every decode token we run. This goes against the scheduling/batching that we're doing (like we'd probably have to add some form of per-token callback to ggml_backend_sched_compute_splits which just feels like it cannot possibly be the correct approach). It's adding a crazy amount of API bloat as well, as already discussed above.

Think I'll take the principled approach suggested by @ggerganov above and just create a single augmented graph. But on the plus side, from this previous attempt I'm pretty confident the MTP subgraph itself is correct, so it wasn't a total waste of time. 🤪

I'll commit the old branch in a sec in case it ever winds up being useful, but I kind of doubt it (outside of as a reference for constructing the MTP subgraph)

…nt is unreasonable

F1LM1 · 2025-08-16T05:01:26Z

On second thought, building a single augmented graph also doesn't work, because we need the main model's sampled token in the MTP subgraph. We could make some shortcut assumptions, like "greedy sample" in the MTP subgraph, but as soon as we fail to match the actual main model sampled token for the first time, the MTP layer's KV cache is invalid.

Something along the lines of the original approach might work, management of the MTP subgraph's KV cache could be made easier by using cparams.embeddings = true and LLAMA_POOLING_TYPE_NONE, decoding an entire batch, then running the entire batch through the MTP head (discarding outputs) to keep its cache up to date.

F1LM1 · 2025-08-17T09:04:55Z

This commit sort of works, in the sense that it outputs tokens but

I can't guarantee that I didn't break things in the multi-slot case,
the model seems a bit... dumb, like it's still generally coherent but something feels a bit off. I might be subtly messing with the main model's KV cache in some way,
the draft acceptance rate is lower than I'd expect, somewhere around 60% rather than >80%. I suspect this is really another symptom of the same issue causing the second point.

Still enough to make me optimistic that this general approach can be refined into a real implementation and avoids the challenges noted in my last two comments.

F1LM1 · 2025-08-19T06:04:16Z

Okay, I believe this commit "works" in that both main model and MTP output both seem correct under my informal test conditions. The model is now about as coherent as the base model is, at least in basic conversations and solving AIME problems. The typical draft acceptance rate is ~70% with my samplers. Usually gets higher than that for simple responses, code, math; lower for creative writing. Would probably be good to compare versus the vLLM implementation to see if that's around expected (I don't have enough VRAM to run either variant in vLLM sadly)

This implementation is still far from done:

inefficient (all MTP batches are size 1, still alternating between single-token and double-token samples)
kludgey
fragile (probably buggy in multi-slot case)
bad stylistically
~~there seems to be a memory leak~~
doesn't accept prompt >n_ubatch

but despite all of that I think it works as a proof of concept. Even with the implementation as poor as it is right now it's giving like 20-25% performance uplift

ghost · 2025-08-20T16:55:53Z

Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top P 0.95). So yeah, probably we shouldn't expect too much from it.

F1LM1 · 2025-08-20T17:01:35Z

Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top P 0.95). So yeah, probably we shouldn't expect too much from it.

This sounds like I have a bug with token caching and/or KV shifting on my end, so far haven't been testing in any setting outside of the server webui itself. My caching behavior is almost certainly wrong outside of optimal scenarios right now. Things are still very WIP.

I'll test later but I'd probably expect 50-60% draft acceptance for RP once everything is working, for reference.

F1LM1 · 2025-08-26T05:39:27Z

Upon a bit of testing on my end in RP/creative writing scenarios, I can't find any obvious issues in terms of correctness with the cache management of this prototype; I think the draft head is just unusually bad in a RP setting. I did change the weird logit-hack then standard sample workflow with a greedy sample, which I think is simpler, less hacky, and provides a slight boost to acceptance rates. But I'm still not getting much more than ~40% draft acceptance in RP scenarios with heavy use of advanced samplers like DRY and XTC. I think it'd be good to know if the vLLM implementation yields similar results.

Aside from that, I'll start refactoring. The last correctness issue, which I think may also be the thorniest, is that the MTP cache updates need the embeddings of the main decode pass in order to run, but we also can only store one ubatch's worth of embeddings at once, so the KV cache update step needs to be moved into llama_context::decode or else n_ubatch needs to be set very large (which kills memory usage). However, this can only be done for maintaining the KV cache during prompt processing; after that, we need access to our actual sampled outputs in order to keep updating the MTP head, so these have to be run after common_sampler_sample. Thankfully at that point n_ubatch stops being a constraining factor, so it's possible. And of course things in general need to be cleaned up significantly

Stealt91 · 2025-08-26T07:53:54Z

Not an expert by any means, but XTC is unlikely to work well with MTP as it's excluding the top choice(s). Have you tried it without XTC? DRY shouldn't be much of an issue, but there possibly is some impact there as well.

…nd n_ubatch

feat: implemented sampling for MTP

…orruption

Stealt91 · 2025-10-04T15:57:36Z

Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so, could another developer try their hands at this? I am eagerly awaiting MTP support and it would be a shame to see no progress on this after an initial working implementation has been done.

I have no experience with llama.cpp in particular, but if it's "only" cleanup work that needs no deep knowledge of ML libraries, I might be able to support.

F1LM1 · 2025-10-05T02:43:53Z

Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so, could another developer try their hands at this? I am eagerly awaiting MTP support and it would be a shame to see no progress on this after an initial working implementation has been done.

I have no experience with llama.cpp in particular, but if it's "only" cleanup work that needs no deep knowledge of ML libraries, I might be able to support.

@SamuelOliveirads is still working on this in a PR on this branch, see F1LM1#3. It's primarily a refactor and some optimizations, but to make my crappy prototype work in a way that is reasonably maintainable/extensible is not that trivial. I've been pretty busy the last few weeks but I don't think we're too far off from a reasonable implementation

Stealt91 · 2025-10-05T10:25:56Z

That is great to hear! Thank you a lot for your effort so far! I had already guessed that it might not be possible to contribute without deeper knowledge, but was willing to give it a try anyway. It's great to hear that @SamuelOliveirads is already working on it!

…tion enum

…helper methods

… debug logs

mtp-batch: batch prompt processing

SamuelOliveirads · 2025-11-02T20:23:26Z

The latest commits successfully integrate the MTP into the llama_decode architecture, maintaining the good output quality from before.

The next major step is optimization. Please note: as of now, this branch is expected to be slower than the baseline without MTP. It should be considered a development preview, not a performance-ready feature.

To tackle the performance issues, a lot of work is needed, particularly in areas where I'm still building my expertise. I've gathered my detailed findings and ideas for optimization in a separate discussion here: F1LM1#4

This is an open invitation for anyone with experience in the following topics to take a look and share any insights. Any help would be greatly appreciated!

Graph reuse and caching strategies
GGML backend allocation and scheduling
Optimizing graph construction for batches with more than one token

F1LM1 added 2 commits August 10, 2025 23:52

added getter for nextn layer count and server slot has_mtp property

db60623

some work towards building mtp layer graph

e434f87

github-actions bot added examples server labels Aug 11, 2025

F1LM1 added 2 commits August 11, 2025 20:54

make nextn weights loadable without a crash

1f477b3

add model member function to build mtp graph, to be called from specu…

03231da

…lative.cpp

ggerganov added the hot Something that is hot label Aug 12, 2025

broad thrust of the mtp implementation

cf0f7c0

failed attempt to implement MTP; outputs tokens but KV cache manageme…

6e9bafc

…nt is unreasonable

added proper KV cache management for MTP layers and slightly refactored

6870f97

F1LM1 added 2 commits August 17, 2025 21:54

fixed mtp kv cache update sequencing after prompt processing

382135a

kludge-y kv cache management of mtp layer

d72f9d5

fixed vram leak

471e026

replace standard sampler with greedy sampler for mtp draft

98bc0c6

RatStar811 approved these changes Aug 29, 2025

View reviewed changes

F1LM1 and others added 2 commits September 2, 2025 17:14

fixed mtp kv cache update step in cases where prompt size > n_batch a…

9fab53e

…nd n_ubatch

feat: implemented sampling for MTP

07670a2

SamuelOliveirads and others added 3 commits September 3, 2025 17:56

fix: add sample acceptance

5a5bce8

feat: apply logits + greedy sampler

8742ce0

Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp

c6237c7

feat: implemented sampling for MTP

pwilkin mentioned this pull request Sep 13, 2025

Feature Request: Qwen3-Next support #15940

Open

4 tasks

SamuelOliveirads added 6 commits September 14, 2025 10:22

mtp-batch (wip): move mtp execution to batch format

1318b2d

mtp-batch (wip): merge mtp and model graph

042eb8a

mtp-batch (wip): merge glm graphs

df64508

mtp-batch (fix): warm mtp cache for small batch size

3da7e7f

mtp-batch (wip): organize batch for mtp cache

75dc25e

mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer c…

67c6c06

…orruption

SamuelOliveirads added 9 commits October 5, 2025 14:43

mtp-batch (wip): fix how to warmup kv cache for MTP

febd823

mtp-batch (feat): Create and manage sinfo for MTP

5e1d719

mtp-batch (fix): prevent mtp draft from polluting the cache

6f74ba3

mtp-batch(refactor): Replace MTP boolean flags with an explicit opera…

913af8f

…tion enum

mtp-batch(refactor): Extract decode context and MTP input logic into …

a99709d

…helper methods

mtp-batch(chore): Fix logit flags for speculative sampling and remove…

b4cbe03

… debug logs

mtp-batch(fix): Correctly advance cache head and add MTP documentation

4bcc9e2

mtp-batch(chore): Remove final MTP debug logs and dead code

0127c6b

mtp-batch(fix): avoid logits for mtp kv cache operations

cae85fe

SamuelOliveirads mentioned this pull request Oct 19, 2025

mtp-batch: batch prompt processing F1LM1/llama.cpp#3

Merged

Merge pull request #3 from SamuelOliveirads/glm4-mtp-batch

c2d7c76

mtp-batch: batch prompt processing

RatStar811 approved these changes Oct 30, 2025

View reviewed changes

RatStar811 approved these changes Nov 1, 2025

View reviewed changes

RatStar811 approved these changes Nov 3, 2025

View reviewed changes

server: implement GLM-style MTP #15225

Are you sure you want to change the base?

server: implement GLM-style MTP #15225

Uh oh!

Conversation

F1LM1 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

slaren commented Aug 13, 2025

Uh oh!

Juahyori commented Aug 13, 2025

Uh oh!

F1LM1 commented Aug 14, 2025

Uh oh!

F1LM1 commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

F1LM1 commented Aug 16, 2025

Uh oh!

F1LM1 commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

F1LM1 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Aug 20, 2025

Uh oh!

F1LM1 commented Aug 20, 2025

Uh oh!

F1LM1 commented Aug 26, 2025

Uh oh!

Stealt91 commented Aug 26, 2025

Uh oh!

Stealt91 commented Oct 4, 2025

Uh oh!

F1LM1 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stealt91 commented Oct 5, 2025

Uh oh!

SamuelOliveirads commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

F1LM1 commented Aug 11, 2025 •

edited

Loading

F1LM1 commented Aug 16, 2025 •

edited

Loading

F1LM1 commented Aug 17, 2025 •

edited

Loading

F1LM1 commented Aug 19, 2025 •

edited

Loading

F1LM1 commented Oct 5, 2025 •

edited

Loading